Read in Data from Chicago

A new survey shows that nearly half of Chicago residents feel “very unsafe” in the city as a whole, and less than a quarter of Chicagoans feel safe in the city where they live. In addition, less than 30 percent of residents feel safe in their neighborhoods, while the same survey 2021 fall showed that 45 percent of the public felt safe in their neighborhoods.

A geospatial risk model is a regression model. The dependent variable is the occurrence of discrete events like crime, fires, etc. Predictions from these models are interpreted as ‘the forecasted risk/opportunity of that event occurring here’.

So I want to find out the regular of assault number that happens on the road. If we use a geo-spatial risk model, would it be possible for the police station to place more policers in places where road assault is likely to occur. Can such a model reduce the occurrence of road assault?

In order to avoid other assault information, like assault happens in the apartment or buildings. The dataset only contains assault happens on the street, sidewalk, and alleys.

policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
## Reading layer `OGRGeoJSON' from data source 
##   `https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON' 
##   using driver `GeoJSON'
## Simple feature collection with 25 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS:  WGS 84
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)
## Reading layer `OGRGeoJSON' from data source 
##   `https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON' 
##   using driver `GeoJSON'
## Simple feature collection with 277 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
## Geodetic CRS:  WGS 84
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

assault <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr")
assPed <- assault %>%
  filter(Primary.Type == "ASSAULT" & (Location.Description == "STREET" | Location.Description == "SIDEWALK" | Location.Description == "ALLEY")) %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
  dplyr::select(-Date, -Updated.On) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
  st_transform('ESRI:102271') %>%
  distinct()

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 
## Reading layer `chicagoBoundary' from data source 
##   `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA///Chapter5/chicagoBoundary.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84

visualizing point data

The mapping elaborates that the street assault often happens concentrated in the west of Chicago, and the center city of Chicago.

# uses grid.arrange to organize independent plots
grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "#ffeecc", color = "orange", size = 0.2) +
  geom_sf(data = assPed, colour="orange", size=0.6, show.legend = "point") +
  labs(title= "Numbers of Pedestrian were Assaulted, Chicago - 2017") +
  mapTheme(title_size = 14),

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "#ffeecc", color = "orange",size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(assPed)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scico::scale_fill_scico(palette = "lajolla") +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Pedestrian be Assaulted") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

Creating a fishnet grid

## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

Aggregate points to the fishnet

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(assPed) %>% 
  mutate(countassPed = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countassPed = replace_na(countassPed, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countassPed), color = "#ffeecc") +
  scico::scale_fill_scico(palette = "lajolla") +
  labs(title = "Count of Pedestrain be assaulted for the fishnet") +
  mapTheme()

Feature Engineering

I choose garbage carts, abandon cars, abandon buildings, graffiti, street lights out, liquor retail as my predictors. Among these indicators, there is a possibility of selection bias appear. The Broken Windows Theory, which posits a link between community ‘disorder’ and crime. The theory is that features of the built environment such as abandoned building and disrepair may signal a local tolerance for criminality. A host of built environment risk factors impact crime risk, but many of these suffer from selection bias. ## add data

##Add two new features

GarbageCarts <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
  mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(fishnet)) %>%
  mutate(Legend = "Garbage_Carts")

abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")
  
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")


liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")


## Neighborhoods to use in LOOCV in a bit
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 
## Reading layer `chicago' from data source 
##   `https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 98 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84

Aggregate features to fishnet

vars_net <- 
  rbind(GarbageCarts,abandonCars,abandonBuildings,graffiti,streetLightsOut,liquorRetail) %>%
  st_join(.,fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
  left_join(fishnet, ., by = "uniqueID") %>%
  spread(Legend, count, fill=0) %>%
  dplyr::select(-`<NA>`) %>%
  ungroup()
## `summarise()` has grouped output by 'uniqueID'. You can override using the
## `.groups` argument.
ggplot() +
  geom_sf(data = vars_net, aes(fill = Garbage_Carts), color = "#ffeecc") +
  scico::scale_fill_scico(palette = "lajolla") +
  labs(title = "Count of Garbage Carts for the fishnet") +
  mapTheme()

Nearest Neighbor Feature

I choose k=3 to calculate the distance of 3 amount of each factor that happens to the neighborhood.

# convinience to reduce length of function names.
st_c    <- st_coordinates
st_coid <- st_centroid

## create NN from abandoned cars
##in this part, I don't know why my selected parameter goes wrong here. I have tried different dataset but it always shows"no applicable method for 'st_coordinates' applied to an object of class "c('double', 'numeric')"". So I can only delete the error parameter...
vars_net <- vars_net %>%
    mutate(Abandoned_Buildings.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(abandonBuildings),
                                           k = 3)) %>%
    mutate(Abandoned_Cars.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(abandonCars),
                                           k = 3)) %>%
    mutate(Garbage_Carts.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(GarbageCarts),
                                           k = 3)) %>%
    mutate(Graffiti.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(graffiti),
                                           k = 3)) %>%
    mutate(Liquor_Retail.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(liquorRetail),
                                           k = 3)) %>%
    mutate(Street_Lights_Out.nn = nn_function(st_c(st_coid(vars_net)), 
                                           st_c(streetLightsOut),
                                           k = 3))

Visualize the NN feature

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

ggplot() +
      geom_sf(data = vars_net.long.nn, aes(fill=value), colour=NA) +
      scico::scale_fill_scico(palette = "lajolla") +
      labs(title="NN factors Distance") +
      mapTheme()

#### Drop the geometry from joining features

final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

Join in areal data

Using spatial joins to join centroids of fishnets to polygon for neighborhoods and districts.

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()
## Joining, by = "uniqueID"
# for live demo
mapview::mapview(final_net, zcol = "District")

Local Moran’s I for fishnet grid cells

## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

 print(final_net.weights, zero.policy=TRUE)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 2237 
## Number of nonzero links: 16904 
## Percentage nonzero weights: 0.3377983 
## Average number of links: 7.556549 
## 2 regions with no links:
## 2203 2267
## 
## Weights style: W 
## Weights constants summary:
##      n      nn   S0       S1       S2
## W 2235 4995225 2235 610.3296 8970.474
## see ?localmoran
local_morans <- localmoran(final_net$countassPed, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(AssPedCount = countassPed, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

Plotting local Moran’s I results

A model fit only to the coldspots, will underfit the hotspots and vice versa. Not only is this local insight useful exploratory analysis, it can also be engineered into powerful spatial features to control for the local spatial process.

## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Pedestrain Assault"))

Distance to Hot spot

In norder to further distinguish between highly significant grid cells and less important grids, I created a dummy variable to represent highly significant clusters, p < 0.001. I then measured the distance of each grid cell to its nearest highly significant cluster, or what we commonly refer to as a hot spot. The visualization of these distance values tells us the relative proximity to and distance from the region of significant clusters.

# generates warning from NN
final_net <- final_net %>% 
  mutate(abandoned.isSig = 
           ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
  mutate(abandoned.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                           abandoned.isSig == 1))), 
                       k = 1))

Plot NN distance to hot spot

The plot shows a high concerntration in north and south of Chicago. We can now model important information on the local spatial process of burglaries.

ggplot() +
      geom_sf(data = final_net, aes(fill=abandoned.isSig.dist), colour=NA) +
      scale_fill_viridis(name="NN Distance") +
      labs(title="Abandoned Car NN Distance") +
      mapTheme()

Correlation Tests

By cross-testing, I hope to filter the indicators by looking at the linear relationship between each indicator and assault. So I choose predictors according to this graph. I choose “abandoned_buildings”,“abandoned_cars.nn”,“garbage_carts”,“graffiti.nn”,“liquor_retail.nn”,“street_light_out” as my predictors.

# View(crossValidate)

## define the variables we want
correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District, -District, -abandoned.isSig, -abandoned.isSig.dist) %>%
    gather(Variable, Value, -countassPed)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countassPed, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countassPed)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 0.8, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Pedestrian assault count as a function of risk factors") +
  plotTheme()
## `geom_smooth()` using formula 'y ~ x'

A histogram of dependent variable

The graph shows a decreasing distribution of count of assault. When data is distributed this way, an OLS regression is inappropriate. In this section, a Poisson Regression is estimated which is uniquely suited to modeling a count outcome like.

ggplot( final_net, aes(x=countassPed)) +
  geom_histogram( binwidth=1, fill="#ffeecc", color="grey", alpha=0.9) +
  ggtitle("Pedestrain in Chicago, 2017") +
  theme_ipsum() +
  theme(
    plot.title = element_text(size=15)
  )

I use two methods to test model performance on new data and on different (spatial) group contexts, like neighborhoods.

calculate errors by NEIGHBORHOOD

reg.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Garbage_Carts"
              )

reg.ss.vars <- c("Abandoned_Buildings", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out", "Garbage_Carts", "abandoned.isSig","abandoned.isSig.dist")
reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countassPed",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countassPed, Prediction, geometry)
## This hold out fold is 20
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(id)` instead of `id` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## This hold out fold is 96 
## This hold out fold is 61 
## This hold out fold is 30 
## This hold out fold is 93 
## This hold out fold is 12 
## This hold out fold is 56 
## This hold out fold is 88 
## This hold out fold is 85 
## This hold out fold is 94 
## This hold out fold is 74 
## This hold out fold is 3 
## This hold out fold is 2 
## This hold out fold is 80 
## This hold out fold is 26 
## This hold out fold is 60 
## This hold out fold is 45 
## This hold out fold is 17 
## This hold out fold is 16 
## This hold out fold is 75 
## This hold out fold is 86 
## This hold out fold is 69 
## This hold out fold is 32 
## This hold out fold is 98 
## This hold out fold is 18 
## This hold out fold is 53 
## This hold out fold is 81 
## This hold out fold is 13 
## This hold out fold is 55 
## This hold out fold is 43 
## This hold out fold is 34 
## This hold out fold is 79 
## This hold out fold is 62 
## This hold out fold is 19 
## This hold out fold is 22 
## This hold out fold is 41 
## This hold out fold is 71 
## This hold out fold is 87 
## This hold out fold is 5 
## This hold out fold is 90 
## This hold out fold is 70 
## This hold out fold is 27 
## This hold out fold is 9 
## This hold out fold is 66 
## This hold out fold is 57 
## This hold out fold is 65 
## This hold out fold is 6 
## This hold out fold is 24 
## This hold out fold is 40 
## This hold out fold is 95 
## This hold out fold is 21 
## This hold out fold is 44 
## This hold out fold is 54 
## This hold out fold is 50 
## This hold out fold is 97 
## This hold out fold is 23 
## This hold out fold is 36 
## This hold out fold is 58 
## This hold out fold is 10 
## This hold out fold is 64 
## This hold out fold is 39 
## This hold out fold is 73 
## This hold out fold is 92 
## This hold out fold is 77 
## This hold out fold is 78 
## This hold out fold is 25 
## This hold out fold is 99 
## This hold out fold is 28 
## This hold out fold is 38 
## This hold out fold is 52 
## This hold out fold is 59 
## This hold out fold is 37 
## This hold out fold is 11 
## This hold out fold is 42 
## This hold out fold is 14 
## This hold out fold is 7 
## This hold out fold is 102 
## This hold out fold is 4 
## This hold out fold is 82 
## This hold out fold is 8 
## This hold out fold is 101 
## This hold out fold is 91 
## This hold out fold is 46 
## This hold out fold is 15 
## This hold out fold is 33 
## This hold out fold is 72 
## This hold out fold is 35 
## This hold out fold is 100 
## This hold out fold is 48 
## This hold out fold is 49 
## This hold out fold is 51 
## This hold out fold is 31 
## This hold out fold is 67 
## This hold out fold is 1 
## This hold out fold is 89 
## This hold out fold is 47 
## This hold out fold is 76 
## This hold out fold is 63 
## This hold out fold is 68 
## This hold out fold is 83 
## This hold out fold is 29 
## This hold out fold is 84
reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countassPed",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countassPed, Prediction, geometry)
## This hold out fold is 20 
## This hold out fold is 96 
## This hold out fold is 61 
## This hold out fold is 30 
## This hold out fold is 93 
## This hold out fold is 12 
## This hold out fold is 56 
## This hold out fold is 88 
## This hold out fold is 85 
## This hold out fold is 94 
## This hold out fold is 74 
## This hold out fold is 3 
## This hold out fold is 2 
## This hold out fold is 80 
## This hold out fold is 26 
## This hold out fold is 60 
## This hold out fold is 45 
## This hold out fold is 17 
## This hold out fold is 16 
## This hold out fold is 75 
## This hold out fold is 86 
## This hold out fold is 69 
## This hold out fold is 32 
## This hold out fold is 98 
## This hold out fold is 18 
## This hold out fold is 53 
## This hold out fold is 81 
## This hold out fold is 13 
## This hold out fold is 55 
## This hold out fold is 43 
## This hold out fold is 34 
## This hold out fold is 79 
## This hold out fold is 62 
## This hold out fold is 19 
## This hold out fold is 22 
## This hold out fold is 41 
## This hold out fold is 71 
## This hold out fold is 87 
## This hold out fold is 5 
## This hold out fold is 90 
## This hold out fold is 70 
## This hold out fold is 27 
## This hold out fold is 9 
## This hold out fold is 66 
## This hold out fold is 57 
## This hold out fold is 65 
## This hold out fold is 6 
## This hold out fold is 24 
## This hold out fold is 40 
## This hold out fold is 95 
## This hold out fold is 21 
## This hold out fold is 44 
## This hold out fold is 54 
## This hold out fold is 50 
## This hold out fold is 97 
## This hold out fold is 23 
## This hold out fold is 36 
## This hold out fold is 58 
## This hold out fold is 10 
## This hold out fold is 64 
## This hold out fold is 39 
## This hold out fold is 73 
## This hold out fold is 92 
## This hold out fold is 77 
## This hold out fold is 78 
## This hold out fold is 25 
## This hold out fold is 99 
## This hold out fold is 28 
## This hold out fold is 38 
## This hold out fold is 52 
## This hold out fold is 59 
## This hold out fold is 37 
## This hold out fold is 11 
## This hold out fold is 42 
## This hold out fold is 14 
## This hold out fold is 7 
## This hold out fold is 102 
## This hold out fold is 4 
## This hold out fold is 82 
## This hold out fold is 8 
## This hold out fold is 101 
## This hold out fold is 91 
## This hold out fold is 46 
## This hold out fold is 15 
## This hold out fold is 33 
## This hold out fold is 72 
## This hold out fold is 35 
## This hold out fold is 100 
## This hold out fold is 48 
## This hold out fold is 49 
## This hold out fold is 51 
## This hold out fold is 31 
## This hold out fold is 67 
## This hold out fold is 1 
## This hold out fold is 89 
## This hold out fold is 47 
## This hold out fold is 76 
## This hold out fold is 63 
## This hold out fold is 68 
## This hold out fold is 83 
## This hold out fold is 29 
## This hold out fold is 84
reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countassPed",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countassPed, Prediction, geometry)
## This hold out fold is Riverdale 
## This hold out fold is Hegewisch 
## This hold out fold is West Pullman 
## This hold out fold is South Deering 
## This hold out fold is Morgan Park 
## This hold out fold is Mount Greenwood 
## This hold out fold is Roseland 
## This hold out fold is Pullman 
## This hold out fold is East Side 
## This hold out fold is Beverly 
## This hold out fold is Washington Heights 
## This hold out fold is Burnside 
## This hold out fold is Calumet Heights 
## This hold out fold is Chatham 
## This hold out fold is South Chicago 
## This hold out fold is Auburn Gresham 
## This hold out fold is Ashburn 
## This hold out fold is Avalon Park 
## This hold out fold is West Lawn 
## This hold out fold is Grand Crossing 
## This hold out fold is South Shore 
## This hold out fold is Chicago Lawn 
## This hold out fold is Englewood 
## This hold out fold is Woodlawn 
## This hold out fold is Clearing 
## This hold out fold is Jackson Park 
## This hold out fold is Washington Park 
## This hold out fold is Garfield Ridge 
## This hold out fold is West Elsdon 
## This hold out fold is Gage Park 
## This hold out fold is Hyde Park 
## This hold out fold is New City 
## This hold out fold is Fuller Park 
## This hold out fold is Archer Heights 
## This hold out fold is Brighton Park 
## This hold out fold is Grand Boulevard 
## This hold out fold is Kenwood 
## This hold out fold is Oakland 
## This hold out fold is Little Village 
## This hold out fold is Mckinley Park 
## This hold out fold is Bridgeport 
## This hold out fold is Armour Square 
## This hold out fold is Douglas 
## This hold out fold is Lower West Side 
## This hold out fold is North Lawndale 
## This hold out fold is Chinatown 
## This hold out fold is Near South Side 
## This hold out fold is Museum Campus 
## This hold out fold is Little Italy, UIC 
## This hold out fold is West Loop 
## This hold out fold is Austin 
## This hold out fold is Printers Row 
## This hold out fold is Garfield Park 
## This hold out fold is Grant Park 
## This hold out fold is United Center 
## This hold out fold is Greektown 
## This hold out fold is Loop 
## This hold out fold is Millenium Park 
## This hold out fold is Humboldt Park 
## This hold out fold is West Town 
## This hold out fold is River North 
## This hold out fold is Streeterville 
## This hold out fold is Ukrainian Village 
## This hold out fold is East Village 
## This hold out fold is Rush & Division 
## This hold out fold is Wicker Park 
## This hold out fold is Gold Coast 
## This hold out fold is Galewood 
## This hold out fold is Old Town 
## This hold out fold is Lincoln Park 
## This hold out fold is Belmont Cragin 
## This hold out fold is Hermosa 
## This hold out fold is Logan Square 
## This hold out fold is Bucktown 
## This hold out fold is Montclare 
## This hold out fold is Sheffield & DePaul 
## This hold out fold is Dunning 
## This hold out fold is Avondale 
## This hold out fold is North Center 
## This hold out fold is Lake View 
## This hold out fold is Portage Park 
## This hold out fold is Irving Park 
## This hold out fold is Boystown 
## This hold out fold is Wrigleyville 
## This hold out fold is Uptown 
## This hold out fold is Albany Park 
## This hold out fold is Lincoln Square 
## This hold out fold is Norwood Park 
## This hold out fold is Jefferson Park 
## This hold out fold is Sauganash,Forest Glen 
## This hold out fold is North Park 
## This hold out fold is Andersonville 
## This hold out fold is Edgewater 
## This hold out fold is West Ridge 
## This hold out fold is Edison Park 
## This hold out fold is Rogers Park
reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countassPed",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countassPed, Prediction, geometry)
## This hold out fold is Riverdale 
## This hold out fold is Hegewisch 
## This hold out fold is West Pullman 
## This hold out fold is South Deering 
## This hold out fold is Morgan Park 
## This hold out fold is Mount Greenwood 
## This hold out fold is Roseland 
## This hold out fold is Pullman 
## This hold out fold is East Side 
## This hold out fold is Beverly 
## This hold out fold is Washington Heights 
## This hold out fold is Burnside 
## This hold out fold is Calumet Heights 
## This hold out fold is Chatham 
## This hold out fold is South Chicago 
## This hold out fold is Auburn Gresham 
## This hold out fold is Ashburn 
## This hold out fold is Avalon Park 
## This hold out fold is West Lawn 
## This hold out fold is Grand Crossing 
## This hold out fold is South Shore 
## This hold out fold is Chicago Lawn 
## This hold out fold is Englewood 
## This hold out fold is Woodlawn 
## This hold out fold is Clearing 
## This hold out fold is Jackson Park 
## This hold out fold is Washington Park 
## This hold out fold is Garfield Ridge 
## This hold out fold is West Elsdon 
## This hold out fold is Gage Park 
## This hold out fold is Hyde Park 
## This hold out fold is New City 
## This hold out fold is Fuller Park 
## This hold out fold is Archer Heights 
## This hold out fold is Brighton Park 
## This hold out fold is Grand Boulevard 
## This hold out fold is Kenwood 
## This hold out fold is Oakland 
## This hold out fold is Little Village 
## This hold out fold is Mckinley Park 
## This hold out fold is Bridgeport 
## This hold out fold is Armour Square 
## This hold out fold is Douglas 
## This hold out fold is Lower West Side 
## This hold out fold is North Lawndale 
## This hold out fold is Chinatown 
## This hold out fold is Near South Side 
## This hold out fold is Museum Campus 
## This hold out fold is Little Italy, UIC 
## This hold out fold is West Loop 
## This hold out fold is Austin 
## This hold out fold is Printers Row 
## This hold out fold is Garfield Park 
## This hold out fold is Grant Park 
## This hold out fold is United Center 
## This hold out fold is Greektown 
## This hold out fold is Loop 
## This hold out fold is Millenium Park 
## This hold out fold is Humboldt Park 
## This hold out fold is West Town 
## This hold out fold is River North 
## This hold out fold is Streeterville 
## This hold out fold is Ukrainian Village 
## This hold out fold is East Village 
## This hold out fold is Rush & Division 
## This hold out fold is Wicker Park 
## This hold out fold is Gold Coast 
## This hold out fold is Galewood 
## This hold out fold is Old Town 
## This hold out fold is Lincoln Park 
## This hold out fold is Belmont Cragin 
## This hold out fold is Hermosa 
## This hold out fold is Logan Square 
## This hold out fold is Bucktown 
## This hold out fold is Montclare 
## This hold out fold is Sheffield & DePaul 
## This hold out fold is Dunning 
## This hold out fold is Avondale 
## This hold out fold is North Center 
## This hold out fold is Lake View 
## This hold out fold is Portage Park 
## This hold out fold is Irving Park 
## This hold out fold is Boystown 
## This hold out fold is Wrigleyville 
## This hold out fold is Uptown 
## This hold out fold is Albany Park 
## This hold out fold is Lincoln Square 
## This hold out fold is Norwood Park 
## This hold out fold is Jefferson Park 
## This hold out fold is Sauganash,Forest Glen 
## This hold out fold is North Park 
## This hold out fold is Andersonville 
## This hold out fold is Edgewater 
## This hold out fold is West Ridge 
## This hold out fold is Edison Park 
## This hold out fold is Rogers Park
reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countassPed,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countassPed,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countassPed,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countassPed,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 


error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countassPed, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
         x="Mean Absolute Error", y="Count") +
    plotTheme()

I compare the total four regression models: Random k-fold CV: Just Risk Factors, Random k-fold CV: Spatial Structure, Spatial LOGO-CV: Just Risk Factors, Spatial LOGO-CV: Spatial Structure. It can be concludes that the MAE of the Spatial LOGO-CV: Spatial Structure is lowest of these four model.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF") 
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 0.47 0.34
Random k-fold CV: Spatial Process 0.43 0.30
Spatial LOGO-CV: Just Risk Factors 1.36 1.21
Spatial LOGO-CV: Spatial Process 1.00 0.91

The table also shows when we take the spatial process into account, the MAE will decrease. It elaborates spatial process is necessary.

error_by_reg_and_fold %>%
  filter(str_detect(Regression, "LOGO")) %>%
  ggplot() +
    geom_sf(aes(fill = MAE)) +
    facet_wrap(~Regression) +
    scale_fill_viridis() +
    labs(title = "PedAssault errors by LOGO-CV Regression") +
    mapTheme() + theme(legend.position="bottom")

error_by_reg_and_fold <- 
  reg.ss.spatialCV %>%
    group_by(cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countassPed, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>% 
  arrange(desc(MAE))
## Simple feature collection with 96 features and 4 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: 341164.6 ymin: 552874.5 xmax: 367164.6 ymax: 594874.5
## Projected CRS: NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 96 × 5
##    cvID            Mean_Error   MAE SD_MAE                              geometry
##    <chr>                <dbl> <dbl>  <dbl>                         <POLYGON [m]>
##  1 Rush & Division      -5.44  5.44   5.44 ((358664.6 580874.5, 358164.6 580874…
##  2 Loop                 -4.53  4.53   4.53 ((357664.6 579374.5, 357664.6 579874…
##  3 Rogers Park          -3.34  3.34   3.34 ((354164.6 593374.5, 353664.6 593374…
##  4 United Center        -2.77  2.77   2.77 ((355164.6 578874.5, 355164.6 578374…
##  5 Washington Park      -2.36  2.36   2.36 ((358664.6 569374.5, 358664.6 569874…
##  6 Englewood             2.31  2.31   2.31 ((355164.6 565374.5, 354664.6 565374…
##  7 Grant Park           -2.29  2.29   2.29 ((358664.6 578374.5, 358664.6 578874…
##  8 South Shore          -2.25  2.25   2.25 ((365164.6 564874.5, 364664.6 564874…
##  9 Printers Row         -2.24  2.24   2.24 ((358664.6 577374.5, 358164.6 577374…
## 10 Chinatown             2.09  2.09   2.09 ((357164.6 575874.5, 357664.6 575874…
## # … with 86 more rows
error_by_reg_and_fold %>% 
  arrange(MAE)
## Simple feature collection with 96 features and 4 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: 341164.6 ymin: 552874.5 xmax: 367164.6 ymax: 594874.5
## Projected CRS: NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 96 × 5
##    cvID               Mean_Error     MAE  SD_MAE                        geometry
##    <chr>                   <dbl>   <dbl>   <dbl>                   <POLYGON [m]>
##  1 South Deering         0.00450 0.00450 0.00450 ((361164.6 555374.5, 361164.6 …
##  2 Sheffield & DePaul   -0.0150  0.0150  0.0150  ((356664.6 583874.5, 356164.6 …
##  3 Oakland              -0.0359  0.0359  0.0359  ((361164.6 572874.5, 361164.6 …
##  4 Lake View             0.0370  0.0370  0.0370  ((356164.6 584874.5, 355664.6 …
##  5 Greektown             0.0410  0.0410  0.0410  ((357164.6 578374.5, 356664.6 …
##  6 Auburn Gresham       -0.0492  0.0492  0.0492  ((354664.6 563374.5, 354664.6 …
##  7 Pullman              -0.0664  0.0664  0.0664  ((360164.6 558874.5, 360164.6 …
##  8 East Side             0.0728  0.0728  0.0728  ((365664.6 558874.5, 365664.6 …
##  9 West Loop            -0.123   0.123   0.123   ((357664.6 576874.5, 357164.6 …
## 10 Bridgeport            0.140   0.140   0.140   ((356164.6 573874.5, 355664.6 …
## # … with 86 more rows
## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  scale_x_continuous(breaks = seq(0, 11, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "LOGO-CV",
         x="Mean Absolute Error", y="Count") 

Density vs predictions

# demo of kernel width
pedass_ppp <- as.ppp(st_coordinates(assPed), W = st_bbox(final_net))
pedass_KD.1000 <- spatstat.core::density.ppp(pedass_ppp, 1000)
pedass_KD.df <-
  data.frame(rasterToPoints(mask(raster(pedass_KD.1000), as(neighborhoods, 'Spatial'))))

as.data.frame(pedass_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(assPed, 1500), size = .5) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2017 pedestrain assault") +
     mapTheme(title_size = 14)

Get 2018 crime data

Let’s see how our model performed relative to KD on the following year’s data.

assPed18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
  filter(Primary.Type == "ASSAULT" & (Location.Description == "STREET" | Location.Description == "SIDEWALK" | Location.Description == "ALLEY")) %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  dplyr::select(-Date, -Updated.On) %>%
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]
pedass_KDE_sum <- as.data.frame(pedass_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 
kde_breaks <- classIntervals(pedass_KDE_sum$value, 
                             n = 5, "fisher")
pedass_KDE_sf <- pedass_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(assPed18) %>% mutate(countassped = 1), ., sum) %>%
    mutate(countassped = replace_na(countassped, 0))) %>%
  dplyr::select(label, Risk_Category, countassped)



ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
pedass_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(assPed18) %>% mutate(countassped = 1), ., sum) %>%
      mutate(countassped = replace_na(countassped, 0))) %>%
  dplyr::select(label,Risk_Category, countassped)
rbind(pedass_KDE_sf, pedass_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(assPed18, 3000), size = .3, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 pedestrian risk predictions; 2018 pedestrian") +
    mapTheme(title_size = 14)

My model is more suitable predict areas in 2nd level density. The accuracy of high density assault area is low.

rbind(pedass_KDE_sf, pedass_risk_sf) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countPedass = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countPedass / sum(countPedass)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE, name = "Model") +
      labs(title = "Risk prediction vs. Kernel density, 2018 CountPedass",
           y = "% of Test Set CountPedass (per model)",
           x = "Risk Category") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
## `summarise()` has grouped output by 'label'. You can override using the
## `.groups` argument.

The histogram has the same conclusion as the mapping. In a low-level density area, the Risk Prediction performs better than Kernel Density. But unfortunately, in a high-density area, it performs not good.

Conclusion

In this assignment, I want to find a geospatial risk prediction model that borrows the assault of pedestrians in places where it has been observed and tests whether that experience generalizes to places where assault of pedestrians may be high, despite few actual events. And whether assault of pedestrians can be used to allocate police response across space.

Following the statement above, I take the spatial structure into account in the prediction model. The geospatial risk prediction model contains six predictors. The result of the cross-validation shows the geospatial risk should be taken into account because the MAE decreased after spatial structure. And the distribution of MAE shows the model has decreased some spatial errors. Consequently, the accuracy test results based on the mean error suggested a fair amount of accuracy in our models, especially the one with spatial structure.

I will not recommend my algorithm be put into production: The first reason is that it has low accuracy in high-assault areas. This means that the police have no way to predict the occurrence of an assault with maximum efficiency. Even if it has good prediction results in middle or low-density areas. The second reason is that the maximum % of test set accuracy is around 0.4, which means police have a maximum prevention efficiency of only 40% if they do advanced prevention. It does not seem to be an outcome that will be accepted by the police.

The reason for this result could be that I did not extract the key spatial structure indicators as independent variables for the regression equation.